ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot by msy-kato · Pull Request #7433 · ggml-org/llama.cpp

msy-kato · 2024-05-21T07:57:34Z

This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q4_0_q8_0 and q4_0_q8_0 vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #5780, but it also includes changes to the block layout.

This PR implements the SVE vector dot with minimal changes as a first SVE support. The performance enhancement is less than that of PR #5780, but it is ~ x1.1 to x1.5 faster than the original implementation.

SVE is enabled if LLAMA_SVE=ON is set in cmake. Here is an example of the compilation commands:

$ cmake -DLLAMA_SVE=ON -B build -S .
$ cmake --build build -j$(($(nproc)/2))

Here are the performance measured on AWS Graviton3E (hpc7g).

### Q4_0_Q8_0
$  ./build/bin/main --model models/llama-2-7b-chat.Q4_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q4_0.gguf-prompt.bin

### Q8_0_Q8_0
$  ./build/bin/main --model models/llama-2-7b-chat.Q8_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q8_0.gguf-prompt.bin

Q4_0_Q8_0

Decoding throughput[token/sec]

Threads	Original(NEON)	This PR(SVE)	Ratio
2	3.16	4.05	1.28
4	6.21	7.88	1.27
8	11.92	14.81	1.24
16	21.54	25.77	1.20
32	32.38	36.21	1.12

Q8_0_Q8_0

Decoding throughput[token/sec]

Threads	Original(NEON)	This PR(SVE)	Ratio
2	3.14	4.60	1.46
4	6.10	8.97	1.47
8	11.46	16.29	1.42
16	20.20	23.77	1.18
32	24.72	26.01	1.05

Limitation: This pull request only supports SVE 256-bit.

github-actions · 2024-05-21T08:26:42Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 527 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8868.41ms p(95)=22845.51ms fails=, finish reason: stop=470 truncated=57
Prompt processing (pp): avg=103.39tk/s p(95)=462.02tk/s
Token generation (tg): avg=47.79tk/s p(95)=48.22tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feat-sve-q4_0_q8_0-q8_0_q8_0 commit=d28bfd5ef7492548d6e000b6ad2cb6042161ec95

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 370.5, 370.5, 370.5, 370.5, 370.5, 897.45, 897.45, 897.45, 897.45, 897.45, 931.93, 931.93, 931.93, 931.93, 931.93, 910.64, 910.64, 910.64, 910.64, 910.64, 913.93, 913.93, 913.93, 913.93, 913.93, 933.06, 933.06, 933.06, 933.06, 933.06, 918.94, 918.94, 918.94, 918.94, 918.94, 911.73, 911.73, 911.73, 911.73, 911.73, 911.08, 911.08, 911.08, 911.08, 911.08, 924.12, 924.12, 924.12, 924.12, 924.12, 918.9, 918.9, 918.9, 918.9, 918.9, 924.42, 924.42, 924.42, 924.42, 924.42, 940.48, 940.48, 940.48, 940.48, 940.48, 928.43, 928.43, 928.43, 928.43, 928.43, 922.96, 922.96, 922.96, 922.96, 922.96, 938.31, 938.31, 938.31, 938.31, 938.31, 872.83, 872.83, 872.83, 872.83, 872.83, 877.28, 877.28, 877.28, 877.28, 877.28, 878.68, 878.68, 878.68, 878.68, 878.68, 875.02, 875.02, 875.02, 875.02, 875.02, 831.24, 831.24, 831.24, 831.24, 831.24, 831.32, 831.32, 831.32, 831.32, 831.32, 827.98, 827.98, 827.98, 827.98, 827.98, 836.54, 836.54, 836.54, 836.54, 836.54, 838.99, 838.99, 838.99, 838.99, 838.99, 814.64, 814.64, 814.64, 814.64, 814.64, 817.76, 817.76, 817.76, 817.76, 817.76, 814.19, 814.19, 814.19, 814.19, 814.19, 762.91, 762.91, 762.91, 762.91, 762.91, 763.24, 763.24, 763.24, 763.24, 763.24, 764.72, 764.72, 764.72, 764.72, 764.72, 771.64, 771.64, 771.64, 771.64, 771.64, 772.88, 772.88, 772.88, 772.88, 772.88, 771.75, 771.75, 771.75, 771.75, 771.75, 782.64, 782.64, 782.64, 782.64, 782.64, 784.09, 784.09, 784.09, 784.09, 784.09, 790.53, 790.53, 790.53, 790.53, 790.53, 787.93, 787.93, 787.93, 787.93, 787.93, 785.44, 785.44, 785.44, 785.44, 785.44, 785.12, 785.12, 785.12, 785.12, 785.12, 784.82, 784.82, 784.82, 784.82, 784.82, 789.81, 789.81, 789.81, 789.81, 789.81, 791.52, 791.52, 791.52, 791.52, 791.52, 788.09, 788.09, 788.09, 788.09, 788.09, 790.83, 790.83, 790.83, 790.83, 790.83, 750.24, 750.24, 750.24, 750.24, 750.24, 749.28, 749.28, 749.28, 749.28, 749.28, 748.96, 748.96, 748.96, 748.96, 748.96, 749.96, 749.96, 749.96, 749.96, 749.96, 750.88, 750.88, 750.88, 750.88, 750.88, 750.83, 750.83, 750.83, 750.83, 750.83, 751.75, 751.75, 751.75, 751.75, 751.75, 756.07, 756.07, 756.07, 756.07, 756.07, 758.37, 758.37, 758.37, 758.37, 758.37, 764.0, 764.0, 764.0, 764.0, 764.0, 763.16, 763.16, 763.16, 763.16, 763.16, 765.44, 765.44, 765.44, 765.44, 765.44, 765.42, 765.42, 765.42, 765.42, 765.42, 767.66, 767.66, 767.66, 767.66, 767.66, 768.26, 768.26, 768.26, 768.26, 768.26, 770.42, 770.42, 770.42, 770.42]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.99, 41.99, 41.99, 41.99, 41.99, 34.85, 34.85, 34.85, 34.85, 34.85, 30.34, 30.34, 30.34, 30.34, 30.34, 27.49, 27.49, 27.49, 27.49, 27.49, 25.73, 25.73, 25.73, 25.73, 25.73, 26.42, 26.42, 26.42, 26.42, 26.42, 26.91, 26.91, 26.91, 26.91, 26.91, 27.62, 27.62, 27.62, 27.62, 27.62, 28.31, 28.31, 28.31, 28.31, 28.31, 29.25, 29.25, 29.25, 29.25, 29.25, 29.46, 29.46, 29.46, 29.46, 29.46, 30.28, 30.28, 30.28, 30.28, 30.28, 30.41, 30.41, 30.41, 30.41, 30.41, 30.66, 30.66, 30.66, 30.66, 30.66, 30.51, 30.51, 30.51, 30.51, 30.51, 30.22, 30.22, 30.22, 30.22, 30.22, 28.99, 28.99, 28.99, 28.99, 28.99, 28.36, 28.36, 28.36, 28.36, 28.36, 28.43, 28.43, 28.43, 28.43, 28.43, 28.95, 28.95, 28.95, 28.95, 28.95, 28.89, 28.89, 28.89, 28.89, 28.89, 28.86, 28.86, 28.86, 28.86, 28.86, 28.43, 28.43, 28.43, 28.43, 28.43, 28.65, 28.65, 28.65, 28.65, 28.65, 28.94, 28.94, 28.94, 28.94, 28.94, 28.98, 28.98, 28.98, 28.98, 28.98, 29.14, 29.14, 29.14, 29.14, 29.14, 29.41, 29.41, 29.41, 29.41, 29.41, 29.48, 29.48, 29.48, 29.48, 29.48, 29.41, 29.41, 29.41, 29.41, 29.41, 29.56, 29.56, 29.56, 29.56, 29.56, 29.87, 29.87, 29.87, 29.87, 29.87, 29.89, 29.89, 29.89, 29.89, 29.89, 30.18, 30.18, 30.18, 30.18, 30.18, 30.3, 30.3, 30.3, 30.3, 30.3, 30.34, 30.34, 30.34, 30.34, 30.34, 30.22, 30.22, 30.22, 30.22, 30.22, 30.04, 30.04, 30.04, 30.04, 30.04, 29.7, 29.7, 29.7, 29.7, 29.7, 29.61, 29.61, 29.61, 29.61, 29.61, 29.7, 29.7, 29.7, 29.7, 29.7, 29.95, 29.95, 29.95, 29.95, 29.95, 29.98, 29.98, 29.98, 29.98, 29.98, 30.12, 30.12, 30.12, 30.12, 30.12, 29.9, 29.9, 29.9, 29.9, 29.9, 29.83, 29.83, 29.83, 29.83, 29.83, 29.46, 29.46, 29.46, 29.46, 29.46, 29.2, 29.2, 29.2, 29.2, 29.2, 28.56, 28.56, 28.56, 28.56, 28.56, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.56, 28.56, 28.56, 28.56, 28.56, 28.55, 28.55, 28.55, 28.55, 28.55, 28.65, 28.65, 28.65, 28.65, 28.65, 28.66, 28.66, 28.66, 28.66, 28.66, 28.57, 28.57, 28.57, 28.57, 28.57, 28.55, 28.55, 28.55, 28.55, 28.55, 28.49, 28.49, 28.49, 28.49, 28.49, 28.58, 28.58, 28.58, 28.58, 28.58, 28.83, 28.83, 28.83, 28.83, 28.83, 28.87, 28.87, 28.87, 28.87]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.42, 0.42, 0.42, 0.42, 0.42, 0.48, 0.48, 0.48, 0.48, 0.48, 0.41, 0.41, 0.41, 0.41, 0.41, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.44, 0.44, 0.44, 0.44, 0.44, 0.34, 0.34, 0.34, 0.34, 0.34, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.35, 0.35, 0.35, 0.35, 0.35, 0.38, 0.38, 0.38, 0.38, 0.38, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.43, 0.43, 0.43, 0.43, 0.43, 0.58, 0.58, 0.58, 0.58, 0.58, 0.49, 0.49, 0.49, 0.49, 0.49, 0.47, 0.47, 0.47, 0.47, 0.47, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.28, 0.28, 0.28, 0.28, 0.28, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.31, 0.31, 0.31, 0.31, 0.31, 0.23, 0.23, 0.23, 0.23, 0.23, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.08, 0.08, 0.08, 0.08, 0.08, 0.09, 0.09, 0.09, 0.09]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0]

ggerganov

Could you demonstrate short perplexity runs produce reasonable values compared to no-SVE?

msy-kato · 2024-05-24T05:32:55Z

Thanks for the comment! I ran perplexity with SVE and no-SVE. The following is the command and partial logs.

### Q8_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 906.69 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.47 seconds per pass - ETA 0.15 minutes
[1]5.2130,[2]7.4447,[3]7.4725,[4]8.4178,
Final estimate: PPL = 8.4178 +/- 1.61226

llama_print_timings:        load time =     314.22 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    9876.98 ms /   512 tokens (   19.29 ms per token,    51.84 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   10796.42 ms /   513 tokens

### Q8_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 915.193 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 0.99 seconds per pass - ETA 0.05 minutes
[1]5.2291,[2]7.4493,[3]7.4706,[4]8.4219,
Final estimate: PPL = 8.4219 +/- 1.61261

llama_print_timings:        load time =     304.68 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    3940.02 ms /   512 tokens (    7.70 ms per token,   129.95 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    4868.40 ms /   513 tokens

### Q4_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 898.157 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.53 seconds per pass - ETA 0.17 minutes
[1]5.4426,[2]7.4845,[3]7.9395,[4]9.0525,
Final estimate: PPL = 9.0525 +/- 1.80378

llama_print_timings:        load time =   13751.66 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   10110.36 ms /   512 tokens (   19.75 ms per token,    50.64 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   11021.03 ms /   513 tokens

### Q4_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 901.443 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 1.09 seconds per pass - ETA 0.07 minutes
[1]5.4306,[2]7.4762,[3]7.9293,[4]9.0456,
Final estimate: PPL = 9.0456 +/- 1.80407

llama_print_timings:        load time =     184.21 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    4340.33 ms /   512 tokens (    8.48 ms per token,   117.96 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    5254.53 ms /   513 tokens

And below is a summary.

SIMD	Type	PPL	total time[ms]
NEON	Q8_0	8.4178 +/- 1.61226	10796.42
SVE	Q8_0	8.4219 +/- 1.61261	4868.4
NEON	Q4_0	9.0525 +/- 1.80378	11021.03
SVE	Q4_0	9.0456 +/- 1.80407	5254.53

This correction does not appear to have any impact on accuracy.

ggerganov · 2024-05-25T08:42:48Z

Thanks. I checked Azure Cloud to see if I can rent a node that supports Arm SVE and it seems soon there will be VMs available: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series?tabs=sizebasic
These VMs are currently in preview - when they become generally available, we can add ggml-ci for that instruction set

JohannesGaessler · 2024-05-26T08:24:40Z

I don't understand why, but after this PR I was having build issues on one of my machines when using make where the GPU could not be detected to determine the correct CUDA arch for -arch=native even though there was no change to the Makefile. However, this seems to have been related to ccache since the compilation worked with LLAMA_NO_CCACHE; deleting ~/.cache/ccache has permanently fixed the issue for me.

msy-kato · 2024-05-27T00:50:30Z

@ggerganov It's greate. Thank you for sharing information. If there is anything I can do to help with CI/CD for SVE implementation, I would like to contribute!

github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels May 21, 2024

mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label May 21, 2024

msy-kato added 2 commits May 23, 2024 19:46

Add SVE support for q4_0_q8_0 q8_0_q8_0

19531ac

remove ifdef

d28bfd5

msy-kato force-pushed the feat-sve-q4_0_q8_0-q8_0_q8_0 branch from d671a17 to d28bfd5 Compare May 23, 2024 11:01

ggerganov approved these changes May 23, 2024

View reviewed changes

ggerganov merged commit faa0e69 into ggml-org:master May 25, 2024

nivibilla mentioned this pull request Jun 25, 2024

Bug: abort on Android (pixel 8 pro) #8109

Open

Vithulep mentioned this pull request Sep 3, 2024

Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, and 128-bit Vector lengths #9290

Merged

4 tasks

fj-y-saito mentioned this pull request Jan 14, 2025

ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot #11227

Merged

Vithulep mentioned this pull request Feb 17, 2025

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot #11917

Merged

Vithulep mentioned this pull request Feb 25, 2025

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot#7433

ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot#7433
ggerganov merged 2 commits intoggml-org:masterfrom
msy-kato:feat-sve-q4_0_q8_0-q8_0_q8_0

msy-kato commented May 21, 2024 •

edited

Loading

Uh oh!

github-actions bot commented May 21, 2024 •

edited

Loading

Uh oh!

ggerganov left a comment

Uh oh!

msy-kato commented May 24, 2024 •

edited

Loading

Uh oh!

ggerganov commented May 25, 2024

Uh oh!

JohannesGaessler commented May 26, 2024

Uh oh!

msy-kato commented May 27, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

msy-kato commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Q4_0_Q8_0

Q8_0_Q8_0

Uh oh!

github-actions bot commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

msy-kato commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 25, 2024

Uh oh!

JohannesGaessler commented May 26, 2024

Uh oh!

msy-kato commented May 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

msy-kato commented May 21, 2024 •

edited

Loading

github-actions bot commented May 21, 2024 •

edited

Loading

msy-kato commented May 24, 2024 •

edited

Loading

msy-kato commented May 27, 2024 •

edited

Loading